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Abstract. We introduce into the classical perceptron algorithm with 
margin a mechanism that shrinks the current weight vector as a first 
step of the update. If the shrinking factor is constant the resulting algo- 
rithm may be regarded as a margin-error-driven version of NORMA with 
constant learning rate. In this case we show that the allowed strength of 
shrinking depends on the value of the maximum margin. We also con- 
sider variable shrinking factors for which there is no such dependence. In 
both cases we obtain new generalizations of the perceptron with margin 
able to provably attain in a finite number of steps any desirable approx- 
imation of the maximal margin hyperplane. The new approximate max- 
imum margin classifiers appear experimentally to be very competitive in 
2-norm soft margin tasks involving linear kernels. 



1 Introduction 

It is widely accepted that the generalization ability of learning machines improves 
as the margin of the solution hyperplane increases [23]. The simplest online 
learning algorithm for binary linear classification, the perceptron [18,13], does 
not aim at any margin. The problem, instead, of finding the optimal separating 
hyperplane is central to Support Vector Machines (SVMs) [23,2]. 

SVMs obtain large margin solutions by solving a constrained quadratic op- 
timization problem using dual variables. In the early days, however, efficient 
implementation of SVMs was hindered by the quadratic dependence of their 
memory requirements on the number of training examples. To overcome this 
obstacle decomposition methods [17, 7] were developed that apply optimization 
only to a subset of the training set. Although such methods led to consider- 
able improvement the problem of excessive runtimes when processing very large 
datasets remained. Only recently the so-called linear SVMs [8,9, 14] by making 
partial use of primal notation in the case of linear kernels managed to successfully 
deal with massive datasets. 

The drawbacks of the dual formulation motivated research long before the 
advent of linear SVMs in alternative large margin classifiers naturally formulated 
in primal space. Having the perceptron as a prototype they focus on the primal 
problem by updating a weight vector which represents their current state when- 
ever a data point presented to them satisfies a specific condition. By exploiting 



their ability to process one example at a time 1 they save time and memory and 
acquire the potential to handle large datasets. The first algorithm of the kind is 
the perceptron with margin [4] the solutions of which provably possess only up to 
1/2 of the maximum margin [11]. Subsequently, various others succeeded in ap- 
proximately attaining maximum margin by employing modified perceptron-like 
update rules. For ROMMA [12] such a rule is the result of a relaxed optimiza- 
tion which reduces all constraints to just two. In contrast, ALMA [6] and much 
later CRAMMA [21] and MICRA [22] employ a "projection" mechanism to re- 
strict the length of the weight vector and adopt a learning rate and margin 
threshold in the condition which both follow specific rules involving the number 
of updates. Very recently, the margitron [15] and the perceptron with dynamic 
margin (PDM) [16] using modified conditions managed to approximately reach 
maximum margin solutions while maintaining the original perceptron update. 

A somewhat different approach from the hard margin one adopted by most of 
the algorithms above was also developed which focuses on the minimization of the 
regularized 1-norm soft margin loss through stochastic gradient descent. Notable 
representatives of this approach are the pioneer NORMA [10] and Pegasos [19]. 
Stochastic gradient descent gives rise to perceptron-like updates an important 
ingredient of which is the "shrinking" of the current weight vector. Shrinking is 
always imposed when a pattern is presented to the algorithm with it being the 
only modification suffered by the weight vector in the event that its condition 
is violated and as a consequence no loss is incurred. The cummulative effect of 
shrinking is to gradually diminish the impact of the earlier contributions to the 
weight vector. Shrinking has also been employed by algorithms which do not 
have their origin in stochastic gradient descent as an accompanying mechanism 
in perceptron-based budget scenarios for classification [3] or tracking [1]. 

Our purpose in the present work is to investigate the role that shrinking of 
the weight vector might play in large margin perceptron learning. This is moti- 
vated by the observation that such a mechanism naturally emerges in attempts 
to attack the 1-norm soft margin task through stochastic gradient descent. If 
we accept that algorithms like NORMA succeed in minimizing the regularized 
1-norm soft margin loss they should be able to solve the hard margin problem as 
well for sufficiently small non-zero values of the regularization parameter which 
also controls the strength of shrinking. Thus shrinking, as weak as it may be, 
when introduced into the perceptron algorithm with margin might prove bene- 
ficial. Another factor to be taken into account is that the shrinking mechanism 
in the algorithms considered here is operative only for erroneous trials, a fea- 
ture that offers them the possibility to terminate in a finite number of steps. 
Therefore, shrinking in such algorithms may need to be strengthened relative to 
algorithms like NORMA to compensate for the fact that the latter shrink the 
weight vector even when the condition is violated. In conclusion, the amount 
of shrinking that a perceptron with margin could tolerate without it destroying 
the conservativeness of the update might be sufficient to raise the theoretically 

1 The conversion of online algorithms to the batch setting is done by cycling repeatedly 
through the dataset and using the last hypothesis for prediction. 



guaranteed fraction of the maximum margin achieved to a value larger than 1/2. 
It turns out that this is actually the case. 

The remaining of this paper is organized as follows. Section 2 contains some 
preliminaries and a description of the algorithms. In Sect. 3 we present a theoret- 
ical analysis of the algorithms. Section 4 is devoted to implementational issues 
and a brief experimental evaluation while Sect. 5 contains our conclusions. 



2 The Algorithms 

Let us consider a linearly separable training set {(xfe, Z/ s )}^!_ 1 , with vectors x k G 
TR d and labels Ik € {+1,-1}. This training set may be either the original dataset 
or the result of a mapping into a feature space of higher dimensionality [23, 2] . By 
placing x k in the same position at a distance p in an additional dimension, i.e., 
by extending Xk to [xk,p], we construct an embedding of our data into the so- 
called augmented space [4] . This way, we construct hyperplanes possessing bias in 
the non-augmented feature space. Following the augmentation, a reflection with 
respect to the origin of the negatively labeled patterns is performed. This allows 

for a uniform treatment of both categories of patterns. Also, R = max||y fe || with 

fc 

y k = [IkXkJkP] the k th augmented and reflected pattern. 

The relation characterizing optimally correct classification of the training 
patterns y k by a weight vector u of unit norm in the augmented space is 

u • Vh > 7d = max minjit'-y,} Vfc 

U':||U'|| = 1 i 

We shall refer to 7d as the maximum directional margin. It coincides with the 
maximum margin in the augmented space with respect to hyperplanes passing 
through the origin. The maximum directional margin 7d is upper bounded by 
the maximum geometric margin 7 in the non-augmented space and tends to it 
as p — >• 00 [20]. 

We consider algorithms in which the augmented weight vector a s t is initially 
set to zero, i.e. = 0, and is updated according to the perceptron-like rule 

a t+i =c s t a s t+ Wk 
each time the "misclassification" condition 

4< -y k <b (3) 

is satisfied by a training pattern y k , i.e., whenever a margin error is made on y k . 
Here < c s t , c\ < 1 are "shrinking factors" which may vary with the number t of 
updates, 77 > is a constant learning rate and b > acts as a margin threshold 
in the misclassification condition. If we set cf = cf = 1 we recover the classical 
perceptron algorithm with margin. The role of cf is to shrink the current weight 
vector as a first step of the update, thereby enhancing the importance of the 
current update relative to the previous one. At the same time such a shrinking 



(1) 



(2) 



acts as a mechanism of effectively increasing the margin threshold of the con- 
dition, an effect that may be further strengthened through the introduction of 
the factor cf in (3). In fact, for appropriate choices of cf,cf, to which we con- 
fine our interest here, it is possible to equivalently introduce shrinking into the 
perccptron with margin via a learning rate and margin threshold which both 
increase with t. Notice that we denote by af the weight vector of the algorithms 
with shrinking to reserve the notation a t for the weight vector of the equivalent 
algorithms with variable learning rate and margin threshold. 

We investigate the impact of 
shrinking on large margin percep- 
tron learning by considering both 
constant and variable shrinking 
factors. If shrinking does not de- 
pend on t we set cf = 1 since a 
constant c\ may be absorbed into 
a redefinition of b. We also express 
cf in terms of a "shrinking param- 
eter" A < I/77 as cf = 1 - 77A. 
Then (2) becomes the update of 
NORMA for a s t y k < b. NORMA, 
however, updates its weight vec- 



Thc Margin Perccptron with Constant Shrinking 

Input: A linearly separable augmented dataset 
S = (y 1 , . . . , y k , . . . , y rn ) with reflection assumed 
Fix: r), A, b 
Define: c s = 1 - r]\ 

Initialize: 4 = 0, ao = 0, r}o = rj, bo = c 3 b 
repeat 

for k = 1 to to do 

if a t ■ y k < b t then 

at+i = a t + r] t y k 
Vt+i = Vt/c s 
bt+i = h/c 3 
t <- t + 1 



until no update made within the for loop 



tor even when af 



Vk 



> b. In this 



case the update reduces to pure 

shrinking af +1 = (1 — r/A)af . This 
is the important difference from our algorithm in which an update occurs only 
if the misclassification condition is satisfied, thereby making convergence in a 
finite number of steps possible. 

Let us divide the update rule (2) with (1 — rjX) 1 and condition (3) with 
(1— r/A) t_1 . Also let a t — af/(l— r/XY^ 1 . Then, we obtain a completely equivalent 
algorithm with update 

a t+1 =a t + ^-r^ Vk ( 4 ) 



and condition 



a fV k < 



(1-r/A) 



t-i 



(5) 



An algorithm with variable shrinking is obtained if we choose cf 



(t/(t + 1))™, where n > is an integer. For n = 1 the shrinking factor c| entering 
the update is the one encountered in Pegasos. Pegasos, however, has variable 
learning rate, c| = 1 and performs, just like NORMA, a pure shrinking update 
when its condition is violated. In addition, its update ends with a projection 
step. A variable shrinking factor t/(t + A) is also employed by SPA [1] in which 
6 = 0. Such a factor is related to ours for A = n even if n 7^ 1 since for t >• n 



Let us multiply both the update rule (2) and condition (3) with (t + l) n and 
set a t = i n af . Then, we obtain a completely equivalent algorithm with update 



ot+i = a t + n(t + l) n y k 



(6) 



and condition 



at Vk < + 1 



)" • (7) 
If we had chosen cf = 1 we should 
have multiplied (3) with t n . As a 
result the threshold in (7) would 
have been bt n , a difference that 
does not seem to be of paramount 
importance. However, the choice 
cf = (t/(t + 1))™ prevailed for the 
sake of convenience. The choice, 
instead, cf = cf = t/(t + n)= 

p?/p? +1 withP™ = n^o(* + fc ) 

would have led to a t = f ™af and 
to the replacement of (t+l) n with 
Pp +1 in (6) and (7). 

We shall refer to the algorithm 
with update (4) and condition (5) 
as the margin perceptron with constant shrinking. The algorithm, instead, with 
update (6) and condition (7) will be called the margin perceptron with vari- 
able shrinking. The above formulations of the algorithms are the ones that will 
henceforth be considered in place of the original formulations of (2) and (3). 



The Margin Perceptron with Variable Shrinking 

Input: A linearly separable augmented dataset 
S = (y 1 , . . . , y k , . . . , y m ) with reflection assumed 
Fix: r), b, n 

Initialize: t = 0, Oo = 
repeat 

for k = 1 to to do 

t„ = (t+ 1)" 
b t = bt n 

f]t = Tft„ 

if a t ■ y k < b t then 

I a t+ i = a t + Vty k 
L t <-t + l 

until no update made within the for loop 



3 Theoretical Analysis 

We begin with the analysis of the margin perceptron with constant shrinking. 

Theorem 1. The margin perceptron with constant shrinking converges in a fi- 
nite number t c of updates satisfying the bound 

1 4- (2 + ^)6 + 5 

n (2 + S)e-S 

provided S = i]R 2 /b < 2 and e = 1 — A6/7^ obey the constraint 5/(2 + 5) < e < 1. 
Moreover, the zero-threshold solution hyperplane possesses margin -f' d which is a 
fraction f of the maximum margin jd obeying the inequality 

Finally, an after-run lower bound on f involving the margin j' d achieved, the 
length \\a t J\ of the solution weight vector a tc and the number t c of updates is 
given by 

^ i-q-^ jrL (10) 



Proof. Taking the inner product of (4) with the optimal direction u and using 
(1) we get 

u ■ at+1 - u ■ at = jr^w u ■ (i^W 7d 

a repeated application of which, taking into account that do = 0, gives 
Here we made use 

of El=o ak = («* - l )/( a - !)• From ( 4 ) and ( 5 ) 

we obtain 

II II 2 II II 2 ^ II 2 j 27? 77^2 + 277(1-^)6 

A repeated application of the above inequality, assuming a® = 0, leads to 

2 ^ 77 2 i? 2 + 277(1 - V X)b = (1 - (1 - ??A) 2t ) (^ 2 + 2(1 - v X)b) 
l|at " ~f^ o (l- v X) 2k A(2-77A)(l- ?? A) 2 ( t - 1 ) 



(12) 

Comparing the lower bound on \\a t \\ from (11) with its upper bound (12) we 
get 



<iHlpW 7 , s i^L_^ ( , fl2 + 2(1 _, iA)t) (13) 

or, noticing that 1 - (1 - r]X) 2t = (1 - (1 - r?A)*) (1 + (1 - r/A)*), we obtain 

1 - (1 - r?A)* < (1 + (1 - 77A)*) A . (14) 

Here 



a^\{^) ''"'"z : ~ 1 ' <i ■ (15) 



b_\ r ] R 2 /b + 2(1 - r]X) 
2~^X 

The condition ^4 < 1 ensures that (14) does lead to an upper bound on the 
number of updates since otherwise (14) is always satisfied. This translates into 
a very restrictive upper bound on the shrinking parameter A. This upper bound 
depends on the values of the remaining parameters but is never larger than 7^/6. 
From (14), provided A < 1, we easily derive the following upper bound on the 
number of updates 

,s *-efW t,,1 i^- (16) 



For S < 2 it holds that 



2 - rjX ~ 2 
and 

>.-(.-«)(i^),(i-«)(i + |)-i-e±|^. <») 



As a consequence, e > 5/(2 + 5) ensures that A < 1. In addition 



2 

ln(l-ryA)- 1 > v \ = 5(l-e)^ . (19) 

Combining (16), (18) and (19) we finally arrive at the slightly simplified upper 
bound on the number of updates given by (8). 

Upon convergence of the algorithm in t c updates condition (5) is violated by 
all patterns. Therefore, the achieved margin 7^ > bj ((1 — r/A)*" 3- 1 ||a t J|). Thus, 



2 _ 7 d % & 2 > A(2-r,A)b 2 

7^ (l-»?A)2(tc-i)|| atc || 2 7 2 " (l-(l-^A)».)(^ + 2(l-r?A)6)^ 

X(2-r]X)b 2 ( Xb 



> 



(1 - (1 - ryA)^*.) ( V R2 + 2(1 - V X)b) 7 2 W - 0- V^ll 



where use has been made of the upper bound (12) on ||a tc || and of the fact that 
(13) at t = ib holds as an equality. Taking the square root and making use of 
the definition of from (16) the previous inequality becomes 



b f l+A \ _l { Xb b \ _ 1 / 2 - 77A 
V 7l \~2a)~ 2 \M + X li) 2 U + 2(1-7?A) 



For 5 < 2 the above inequality gives rise to (9) because of (17). 

Finally, (10) is readily obtained if in the ratio 7j/7d we employ the upper 
bound on 7d derivable from (11). □ 

Remark 1. The parameters 5 and e are independent. Therefore, we may consider 
choosing S <C 1 while keeping e fixed. In this case the upper bound (8) on the 
number of updates becomes O (5 _1 i? 2 / 7 j) and from (9) the before-run lower 
bound on / approaches as S — >• the value 1 — e/2. This generalizes the well- 
known result that the classical perceptron algorithm with margin (obtainable 
when A — >• or e — >• 1) has in the limit S — > a theoretically guaranteed before- 
run value of / equal to 1/2. By subsequently letting e — > (i.e., A — > 7 2 /6) we 
may approach solutions with maximum margin. 

Remark 2. To facilitate comparison with other large margin classifiers we may 
relate the independent parameters 8 and e and obtain a single parameter Q < 
1/V2 through the relations S = 2(, e = 5(l+6)/(2 + 5) = C(l + 2C)/(l + 0- Then, 
from (8) and (9) we have that the margin perceptron with constant shrinking 
achieves "accuracy" £, i.e., 

/ > 1 - c , 

in a number t c of updates satisfying the bound 



Notice that the quantity R/^fd does not enter the logarithm. In this sense the 
above bound, which is O ((C _1 i? 2 /7 2 ) In for ( <C 1, is the best among the 



bounds of perceptron-like maximum margin algorithms. Typically, algorithms 
which require at least an approximate knowledge of the value of 7d to tune their 
parameters have bounds O ((C _1 -R 2 /7d) l n (C 1 (R/ld) k )j with k = 1,2 while 
algorithms which do not assume such a knowledge have bounds O ((~ 2 R 2 /7 d ) . 

Remark 3. Suppose we are given 7d < 7^. It may be expressed as 7d = (1 — £)7d- 
Setting A = (2/(2 + 6))j%/b it holds that e = 1 - (2/(2 + 5)){l - £) 2 > 5/(2 + 6). 
Then (9) gives 7d / 7d > 1 - f + (£ 2 - 6(1 - £)) /(2 + <5). Thus, for 6(1 - < 
a solution with margin 7 d > 7d is obtained which provides a better lower bound 
on 7d than the one used as an input. A repeated application of this procedure 
starting, e.g., with 7d = 0, £ = 1, A = gives solutions possessing margin which 
is any desirable approximation of 7d without prior knowledge of its value. An 
estimate of the quality of the approximation at each stage may be obtained via 
the after-run lower bound (10) on 7 d /7d which provides an upper bound on 7d. 
In practice, only a few repetitions of this procedure are required to obtain a 
satisfactory approximation of the optimal solution because the margin actually 
achieved by the algorithm is considerably larger than the one suggested by (9). 

Remark 4- From (12) we see that for the algorithm described by (2) and (3) 
with c s t = 1 - TjX, c? = 1 it holds that ||af|| 2 = ||a t || 2 (l - ^A)^'" 1 ) < 
(rjR 2 + 2(1 — i]X)b) I (A(2 — r]X)). Thus, it is confirmed in this context the well- 
known fact that constant shrinking leads to bounded length of the weight vector. 

To proceed with our analysis of the margin perceptron with variable shrinking 
we need some inequalities involving sums of powers of integers which we present 
in the form of lemmas. Their proofs can be found in the Appendix. 

Lemma 1. Let n > be an integer. Then, it holds that 

t 

(n + 1) k n < t(t + l) n . (20) 

k=l 

Lemma 2. Let n > be an integer. Then, it holds that 

(n+i)]Tfc">(t + ir +i -^^(t+ir ■ (21) 

*;=i 

Lemma 3. Let n>0 be an integer. Then, it holds that 

(2n + l)t k 2n <(n + I) 2 ( ^ kA . (22) 
fe=i \fc=i / 

Now we are ready to move on with the analysis of the variable shrinking case. 

Theorem 2. The margin perceptron with variable shrinking converges in a finite 
number t c of updates satisfying the bound 

(n + 1) 2 ( 2b \ R 2 



Moreover, the zero-threshold solution hyperplane possesses margin -f' d which is a 
fraction f of the maximum margin -fd obeying the inequality 

Finally, an after-run lower bound on f involving the margin j' d achieved, the 
length \\a tc \\ of the solution weight vector a tc and the number t c of updates is 
given by 

/^X>"^- (25) 
fe=i 11 tcl 

Proof. Taking the inner product of (6) with the optimal direction u and using 
(1) we get 

u ■ a t+ i -u-a t =n(t + l) n u ■ y k > n(t + l) n ~f d 
a repeated application of which, taking into account that ag = 0, gives 

t 

\\a t \\ > u- at > i]JdJ2 kn ■ ( 26 ) 
fe=i 

From (6) and (7) we obtain 

||a m || 2 -!|aJ 2 =r ? 2 (t + l) 2 H|yJ|^2r / (t + l)"a t .y fc <( ?? 2 i? 2 + 2r / 6)(t + l) 2 " . 

A repeated application of the above inequality, assuming a = 0, leads to 

t 

\\a t f < (v 2 R 2 + 2ry6) ]T k 2n . (27) 



k=l 



Combining (26) and (27) we obtain 



t 



ifll E fc " <ll«*H 2 <(^ 2 + 2^)^fc 2 " (28) 



\fe=i / fe=i 

or 

/ * \ / t \ -2 

? 2 



which by virtue of (22) gives (23). 

Upon convergence of the algorithm in t c updates condition (7) is violated by 
all patterns. Therefore, the margin 7^ achieved satisfies 7^ > b(t c + l)™/||a t J|. 
Thus, 

, 2 _7^ 2 ^ b 2 (tc + l) 2 " > fe 2 ft c + l) 2 " (2n + l)6 2 

J -2 > On ,,2 — O, o „n „ ,Nt-^t„ , o„ — 



(29) 



Here we replaced ||a tc || 2 with its upper bound (rj 2 R 2 + 2rjb) Y^k=i k 2n from (27) 
and J2k=i fc2 " with its u PP er bound t c (t c + l) 2n /(2n + 1) from (20). Ovcrap- 
proximating t c by % in (29) and substituting the value of the latter from (23) 
we get 

2> (2n + l)6 2 _({2n+l) b 



7 2 { V R 2 + 2b)rjt h \ (n + 1 ) {rjR 2 + 2b) / 

from where by taking the square root we obtain (24). 

Finally, (25) is readily obtained if in the ratio 7 d /7d we employ the upper 
bound on 7 d derivable from (26). □ 

Remark 5. Let us define S = r]R 2 /b and e = (n + l)" 1 . Then, (23) and (24) 
become 

1 fl + S/2\ R 2 



eS \l-e/2j 7 2 

and 

/> 



1 - e/2 



1 + 5/2 ' 

respectively. The perceptron with margin corresponds to n = or e = 1. If we 
choose S <C 1 while keeping e (i.e., n) fixed the upper bound on the number of 
updates becomes O (<5 _1 i? 2 /7 d ) and the before-run lower bound on / approaches 
as 5 — > the value 1 — e/2. Then, by allowing e — > (i.e., n — > oo) maximum 
margin solutions are approximated. If we set, instead, S = e <C 1 then / > 1 — e 
and the algorithm achieves "accuracy" e in at most e~ 2 R 2 + O (e~ 1 R 2 /j 2 ) 
updates. This is among the best bounds of perceptron-like approximate maxi- 
mum margin classifiers which do not assume knowledge of the value of 7c j in any 
way. For comparison, ALMA's bound is ~ 8e~ 2 i? 2 /7d- 

Remark 6. Given that f 2 < 1 (29) leads to a lower bound on the number t c of 
updates required for convergence of the margin perceptron with variable shrink- 
ing which in terms of the parameters 5 and e reads 

1 (\ - e/2 \ R 



eS \l + S/2j 7 2 • 

As 5, e — > the ratio of the above lower bound to the upper bound tends to 1 
and the algorithm approaches the optimal solution in ~ (eS)~ 1 R 2 /j 2 updates. 

Remark 7. Theorems 1 and 2 hold also for the algorithms described by (2) and 
(3) as appropriate provided, of course, that ||a tc || is replaced in (10) and (25) 
with by making use of the relation connecting these two quantities. 

Remark 8. The after-run lower bounds on / given by (10) and (25) typically 
provide estimates of the margin achieved which are much more accurate than 
the ones obtained from the before-run bounds of (9) and (24), respectively. Our 
experience based on such estimates suggests that a satisfactory approximation 
of the maximum margin solution can be obtained without the need to resort to 



very small values of the parameter e. In other words, although the theoretically 
guaranteed before-run fraction of the maximum margin for S <C 1 is close to 
1 — e/2 both the estimated after- run fraction and the one actually achieved are 
larger. This is a generic feature of the perceptron with margin and its general- 
izations. It turns out that in most cases e ~ 0.2 — 0.3 is sufficiently small for the 
algorithm to obtain for S <C 1 solutions possessing 99% of the maximum margin. 
Thus, for constant shrinking a very accurate knowledge of the value of 7d is not 
required while for variable shrinking very low values of n are sufficient. 



4 Implementation and Experiments 

To reduce the computational cost we adopt a two-member nested sequence of 
reduced "active sets" of data points as described in detail in [15]. The parameter 
c which multiplies the threshold of the misclassification condition when this 
condition is used to select the points of the first-level active set is given the 
value c = 1.01. The parameters, instead, determining the number of times the 
active sets are presented to the algorithm are set to the values N cpi = N cp2 = 5. 

An additional mechanism providing a substantial improvement of the com- 
putational efficiency is the one of performing multiple updates [14-16] once a 
data point is presented to the algorithm. It is understood, of course, that a mul- 
tiple update should be equivalent to a certain number of updates occurring as 
a result of repeatedly presenting to the algorithm the data point in question. 
Thus, the maximal multiplicity of such an update will be determined by the 
requirement that the pattern y k which satisfies the misclassification condition 
will just violate it as a result of the multiple update. For constant shrinking a 
multiple update is 

i - (i - ^xy 

with 

(1 - ryA)*- 1 ^ • y k 




ln(l - ijA)" 1 ~ V~ ' " ||yJ| 2 -A6 



1 



HVfcl 

Here [x] is the integer part of x > 0. For variable shrinking, instead, finding the 
maximal multiplicity of the update involves solving a (n+l)-th degree equation 
for which there is no general formula unless n < 3. However, this does not pose 
a serious problem for several reasons. First of all, as we have already pointed out 
in Remark 8, we may reach very good approximations of the maximal margin 
hypcrplane with low values of n. In addition, even if we choose a larger n we may 
obtain satisfactory performance with updates having multiplicity lower than the 
maximal one. Thus, it suffices to find a lower bound on the relevant root of the 
(n+l)-th degree equation. Moreover, even when the exact root is available it is 
often preferable to set an upper bound £ up on the multiplicity of the updates. 

The aim of our experiments is to assess the ability of the margin percep- 
tron with constant shrinking (MPCS) and the margin perceptron with variable 
shrinking (MPVS) to achieve fast convergence to a certain approximation of the 



Table 1. Results of experiments with the algorithms MPCS and MPVS. 



data 
set 


#inst 


#attr 


A 


MPCS 


MPVS n = 3 


10 7 A 


10 4 7? 


io 4 7d 


/> 


s 


10 4 ?? 


10 4 7 d 


/> 


s 


Adult 


32561 


123 


1 


34 


4.5 


84.53 


0.990 


1.0 


20 


84.57 


0.990 


0.9 


Web 


49749 


300 


1 


29 


7.5 


209.3 


0.987 


0.6 


16 


209.5 


0.992 


0.5 


Physics 


50000 


70 


1 


0.75 


114 


44.56 


0.991 


3.9 
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4.5 


News20 
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0.1 
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3.8 
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14.2 


Covertype 
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47.49 


0.990 


22.3 
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47.50 


0.990 


27.9 
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0.3 
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10.04 


0.988 


34.9 


50 


10.04 


0.988 


34.6 
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47236 


0.3 


6.9 
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13.80 


0.989 


41.1 


25 


13.79 


0.987 


41.6 
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804414 


47236 
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3.1 


5.1 


9.271 


0.989 


64.0 


30 


9.270 


0.989 


73.2 



optimal solution in the feature space where the patterns are linearly separable. 
For linearly separable data the feature space is the initial instance space. For 
inseparable data, instead, a space extended by m dimensions, as many as the 
instances, is considered where each instance is placed at a distance A from the 
origin in the corresponding dimension 2 [5] . This extension generates a margin of 
at least A/^Jm and its employment relies on the well-known equivalence between 
the hard margin optimization in the extended space and the soft margin opti- 
mization in the initial instance space with objective function \\w || 2 + A -2 ^^ 2 
involving the weight vector w and the 2-norm of the slacks & [2]. 

In the experiments the augmentation parameter p was set to the value p = 
1. The values of the parameter A together with the number of instances and 
attributes of the datasets used are given in Table 1. Further details may be found 
in [16]. The experiments, like the ones of [16], were conducted on a 2.5 GHz Intel 
Core 2 Duo processor with 3 GB RAM running Windows Vista. Therefore, the 
runtimes reported here can be directly compared to the ones of [16]. Our codes 
written in C++ were compiled using the g++ compiler under Cygwin. They are 
available at http : //users . auth . gr/costapan. 

In the numerical experiments the results of which we report in Table 1 the 
algorithms MPCS and MPVS were required to obtain solutions possessing 99% 
of the maximum margin 7d. Additionally, we imposed a cut-off value £ up = 1000 
on the multiplicity of the updates. We set b — R 2 such that 5 = rjR 2 /b = rj for 
both algorithms. For MPCS assuming knowledge of 7<j we chose A ~ 0.757^/6 
such that e ~ 0.25. In the case of MPVS we set n — 3 giving e = (n+l) -1 = 0.25. 
Thus, for both algorithms the asymptotic value of the theoretically guaranteed 
fraction of 7d that they were able to achieve in the limit S — > was 1 — e/2 ~ 
0.875. The lower bound on the fraction / reported is the after-run bound of (10) 

2 Vk — [Vki IkASik, ■ ■ ■ , IkASmk], where Sij is Kronecker's 8 and y k the projection 
of the k th extended instance y k (multiplied by its label Ik) onto the initial instance 
space. The feature space mapping defined by the extension commutes with a possible 
augmentation (with parameter p) in which case y k — [IkXk, hp]- Here Xk represents 
the fc th data point. 



and (25) which turns out in most cases to be ~ 0.99 and certainly much larger 
than the before-run fraction ~ 0.875 in accordance with our earlier discussion 
in Remark 8. The required value of the margin was achieved by sufficiently 
lowering the value of r] having knowledge of the target value. However, even 
if such a knowledge were not available we could have reached our goal guided 
by the after-run lower bound on /. From Table 1 we see that the runtimes (in 
seconds) of MPCS and MPVS for the same value 7 d of the margin achieved are 
comparable. More important, though, is a comparison with the results obtained 
with other large margin classifiers as reported in [16]. We see that MPCS and 
MPVS are orders of magnitude faster than ROMMA and SVM light [7], faster 
than PDM and of comparable speed or at most about 2 times slower than the 
linear SVM algorithms DCD [9] and MPU [14]. We should note, however, that 
unlike our algorithms linear SVMs are not primal and strictly online. 

Finally, we would like to point out that in practice it is possible to set at one 
stage the parameter A of MPCS without prior knowledge of the value of 7^. A 
preliminary run of MPCS with an almost vanishing A provides a lower bound on 
7d which is the margin 7 d achieved and an upper bound from the after-run lower 
bound on /. Actually, 7d usually lies closer to its upper bound. This information 
is sufficient to choose A given that the algorithm is not extremely sensitive to 
this choice provided, of course, that A remains below its maximal allowed value. 

5 Conclusions 

Motivated by the presence of weight shrinking in most attempts at solving the 
Ll-SVM problem via stochastic gradient descent we introduced this feature into 
the classical perceptron algorithm with margin. In the case of constant weight 
decay parameter A and constant learning rate we demonstrated that conver- 
gence to solutions with approximately maximum margin requires A to approach 
a margin-dependent maximal allowed value. Scenarios with variable shrinking 
strength were also considered and proven not to be subject to such limitations. 
The theoretical analysis was corroborated by an experimental investigation with 
massive datasets which involved searching for large margin solutions in an ex- 
tended feature space, a problem equivalent to the 2-norm soft margin one. As 
a final conclusion of our study we may say that shrinking of the current weight 
vector as a first step of the update is able to elevate the margin perceptron to a 
very effective primal online large margin classifier. 
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A Proof of Lemma 1 

Proof. We proceed by induction in the integer t. For t = 1 inequality (20) re- 
duces to (n + 1) < 2" which holds since 2" = (1 + 1)" > I + n. Now let 
us assume that (20) holds and prove that (n + 1) ££ti k n < (t + l)(t + 2) n 
or (n + 1) ({t + 1)™ + Y? k =i k ") < (t + 1)(* + 2)™. Given that (20) holds it 



suffices to prove that (n + l)(t + 1)™ + t(t + 1)" < (t + l)(t + 2) n or that 
(t + 2) n > (f + l)"- 1 (n + l + t). Indeed, (i + 2)™ = (t + 1)" (l + (t + l)" 1 )" > 
(t + 1)™ (1 + n(t + l)- 1 ) = (t + l)"- 1 ^ + 1 + n). □ 



B Proof of Lemma 2 

Proof. We proceed by induction in the integer t. For < = 1 inequality (21) reduces 
to (n + l)(2n +1) > 2" (1 — n(n — 2)) which holds Vn > 0. Now let us assume 
that (21) holds and prove that Y?k=i k n >-^{t + 2)" +1 - ^±J_(f + 2)™. Using 

(21) we have Ell'i fc n = (t + l) n + ELi fc " > (* + 1)" + T^M* + " 
^±J_(t + 1)™ = ^(t + l) Il+1 + ^-(t + 1)"- Thus, it suffices to prove that 

= + 2 )" + l^Tiit + !)" - ^ ((* + 2)" +1 ~ ( f + > OT that 

F(t)/t n > 0. By virtue of the binomial formula F{t)/t n admits the expansion 

£W = V n! A" + l)2'+n _ 2' +1 -1 \ , 
t n ^ H(n-Z)! V 2n + l 1 + 1 J 

Given that ((n + 1)2* + n)(/ + 1) - (2 l+1 - l)(2n + 1) = ((/ - 3)2' + / + 3)n + 
(I — 1)2' + 1 > V7 > the terms in the above expansion are all non-negative 
implying F(t)/t n > 0. □ 



C Proof of Lemma 3 

Proof. We proceed by induction in the integer t. For t = 1 inequality (22) reduces 
to 2n+ 1 < (n+ l) 2 which obviously holds Vn > 0. Now let us assume that (22) 

holds and prove that (2n+l)(t + l) J2{ + =\ k 2n < {n + lf {j2lt\ • Using (22) 
we have (2n + l)(t + 1) Elt\ fc 2 " = (2n + l)(t + 1) ((t + l) 2 " + ELi fc2 ") = 
(2n + 1)(* + 1) 2 " +1 + *±*(2n + l)*El=i fc2 " ^ ( 2n + *)(* + X ) 2 " +1 + + 
I) 2 (ELi ^") 2 - Also („ + 1) 2 (Elt 1 ! fc")' = (n+1) 2 ((t + 1)" + ELi fc")' = 
(n + l) 2 (t + l) 2 " + (n + l) 2 (ELi fc")' + 2 ( n + !) 2 (* + !)" ELi fc "- Thus > 
suffices to prove that (2n+l)(i + l) 2n+1 + i±±(n + l) 2 (El=i fc ") 2 < (n + l) 2 (t + 
l) 2n + (n + l) 2 (ELi fc") 2 + 2 ( n + !) 2 (* + !)" ELi fc" or > equivalents, that 
(n + 1) (2(n + l)t(t + 1)" - (n + 1) ELi fc ") ELi fc " + ( n + !) 2 *(* + l ? n ~ 
(2n+ l)i(i+ 1) 2 ™ +1 > 0. Replacing in the above inequality (n + 1) ELi fc " with 
its upper bound £(t + 1)™ from (20) we end up with the inequality (n + l)(2n + 
l)t(t+l) n ELi fc" + (^+ l) 2 *(t+ 1) 2 " - (2n+ l)t(t + l) 2n+1 > to prove which, 
however, is equivalent to (21). □ 



